Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add voiceactivity action. #333

Merged
merged 7 commits into from
Jul 18, 2024
Merged

Add voiceactivity action. #333

merged 7 commits into from
Jul 18, 2024

Conversation

jianjunz
Copy link
Member

@jianjunz jianjunz commented Jun 28, 2024

This change adds support for the voice activity detection (VAD) feature for microphones. It allows applications to show a notification when the user is speaking while the MediaStreamTrack is muted.

Related discussion in mediacapture-extensions spec: issue, pr, slide


Preview | Diff

This change adds support for the voice activity detection (VAD) feature
for microphones. It allows application to show a notification when user
is speaking but MediaStreamTrack is muted.
Copy link
Contributor

@youennf youennf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below, looks good overall to me.

We might want to add some text related to toggleMicrophone/Camera/voiceactivity source's target so that these actions are called on MediaSession objects whose context has capture MediaStreamTracks. This should be done in a separate PR though.

index.bs Outdated
<p>
A user agent MUST invoke {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when the voice activity is
detected from a source with one or more live {{MediaStreamTrack}}s. A user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to state that this is restricted to microphone tracks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed "source" to "microphone".

@@ -1496,6 +1513,7 @@ parameter whose dictionary type is:
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/nextslide}}.</li>
<li>{{MediaSessionActionDetails}} for
{{MediaSessionAction/enterpictureinpicture}}.</li>
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/voiceactivity}}.</li>
</ul>

The <dfn dict-member for="MediaSessionActionDetails">action</dfn>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could an example that links voice activity with displaying a UI that can execute setMicrophoneActive(true)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jianjunz
Copy link
Member Author

We might want to add some text related to toggleMicrophone/Camera/voiceactivity source's target so that these actions are called on MediaSession objects whose context has capture MediaStreamTracks. This should be done in a separate PR though.

I also prefer a separate PR for toggleMicrophone/Camera, so this PR doesn't change anything other than voiceactivity action.

@chrisn chrisn added the agenda Topic should be discussed in a group call label Jul 1, 2024
Copy link
Member

@chrisn chrisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments inline. My bigger question is why expose this via Media Session and not on the media stream track? Also need to think about the privacy implications of allowing pages to detect voice activity on muted tracks.

index.bs Outdated
Comment on lines 417 to 418
the action's intent is to notify the action handler that a voice
activity is started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the following be a bit clearer?

Suggested change
the action's intent is to notify the action handler that a voice
activity is started.
the action's intent is to notify the web page that voice
activity has been detected by the microphone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to make it clearer.

index.bs Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience.
{{MediaSessionActionHandler}} before running, as different tasks, the
steps defined to [$set a track's muted state$].
</p>
<p>
A user agent MUST invoke {{MediaSessionActionHandler}} for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A user agent MUST invoke {{MediaSessionActionHandler}} for
A user agent MUST invoke the {{MediaSessionActionHandler}} for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience.
{{MediaSessionActionHandler}} before running, as different tasks, the
steps defined to [$set a track's muted state$].
</p>
<p>
A user agent MUST invoke {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when the voice activity is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{MediaSessionAction/voiceactivity}} only when the voice activity is
{{MediaSessionAction/voiceactivity}} only when voice activity is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
user agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all
{{MediaStreamTrack}}s associated with the source are not
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a
minimal interval for invoking {{MediaSessionActionHandler}} for
Copy link
Member

@chrisn chrisn Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked? And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations? (Not suggesting we spec these things, just clarify what we're recommending user agents to consider)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked?

It's intended to be a minimal interval between two voiceactivity actions (or events? event sounds to be more accurate here).

And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations?

It actually depends on the voice activity detection (VAD) algorithm. Sometimes VAD algorithm may even consider background noise as a voice activity. Based on the use case (unmute microphone notification) we want to target, it is recommended not invoking this action handler too frequently.

A new note section is added for some background about this action.

@youennf
Copy link
Contributor

youennf commented Jul 1, 2024

My bigger question is why expose this via Media Session and not on the media stream track?

Good question.
The goal of the action is to allow the web page to ask the user whether they want to unmute microphone capture.
This is done via MediaSession setMicrophoneActive.
Hence why it is best placed at MediaSession level.

There is some desire to expose whether there is voice info in live microphone capture.
This is a different use case that seems best resolved at MediaStreamTrack level, not via MediaSession.

Also need to think about the privacy implications of allowing pages to detect voice activity on muted tracks.

Again, good question.
First, OSes like macOS or iOS added support for muted talker detection (capture is muted but there is still the possibility to know whether user might be talking). This is a good sign that the same functionality is desirable for web applications.

If web applications are not able to help users unmute themselves, web applications will not ask to be muted.
They will instead keep microphone capture running just to detect muted talkers, mute meaning in that case to not send any audio. So we are kept in a world where OS/UA tells microphone is on but Web application tells microphone is off.

FWIW, in Safari, there is a capture indicator telling whether capture is running (red icon), capture is muted (bar gray icon) or capture is not running (no icon). This new action would only happen in the second case, not in the third one (much more problematic).

@youennf
Copy link
Contributor

youennf commented Jul 1, 2024

Thanks for putting that to the agenda.

Copy link
Member Author

@jianjunz jianjunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris, thanks for your review.

Also thanks to Youenn for providing detailed info about this action, and OS support.

index.bs Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience.
{{MediaSessionActionHandler}} before running, as different tasks, the
steps defined to [$set a track's muted state$].
</p>
<p>
A user agent MUST invoke {{MediaSessionActionHandler}} for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience.
{{MediaSessionActionHandler}} before running, as different tasks, the
steps defined to [$set a track's muted state$].
</p>
<p>
A user agent MUST invoke {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when the voice activity is
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
user agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all
{{MediaStreamTrack}}s associated with the source are not
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a
minimal interval for invoking {{MediaSessionActionHandler}} for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked?

It's intended to be a minimal interval between two voiceactivity actions (or events? event sounds to be more accurate here).

And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations?

It actually depends on the voice activity detection (VAD) algorithm. Sometimes VAD algorithm may even consider background noise as a voice activity. Based on the use case (unmute microphone notification) we want to target, it is recommended not invoking this action handler too frequently.

A new note section is added for some background about this action.

Copy link
Contributor

@steimelchrome steimelchrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me

Copy link
Member

@chrisn chrisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more editorial suggestions, but overall this looks good. Thank you!

index.bs Outdated
@@ -412,6 +412,11 @@ platform UI or media keys, thereby improving the user experience.
the action's intent is to open the media session in a
picture-in-picture window.
</li>
<li>
<dfn enum-value for=MediaSessionAction>voiceactivity</dfn>:
the action's intent is to notify the web page that a voice activity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the action's intent is to notify the web page that a voice activity
the action's intent is to notify the web page that voice activity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all
{{MediaStreamTrack}}s associated with the source are not
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a
minimal interval for invoking {{MediaSessionActionHandler}} for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
minimal interval for invoking {{MediaSessionActionHandler}} for
minimal interval between invocations of the {{MediaSessionActionHandler}} for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
</p>

<p class=note>
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
{{MediaSessionAction/voiceactivity}} only indicates the start of voice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated

<p class=note>
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
activity. Application may display a notification if the user is speaking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
activity. Application may display a notification if the user is speaking
activity. Applications may display a notification if the user is speaking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
activity. Application may display a notification if the user is speaking
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for
audio processing. No action is defined for the end of a voice activity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
audio processing. No action is defined for the end of a voice activity.
audio processing. No action is defined for the end of voice activity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
activity. Application may display a notification if the user is speaking
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for
audio processing. No action is defined for the end of a voice activity.
Unlike other actions which are explicitely triggered by the user,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unlike other actions which are explicitely triggered by the user,
Unlike other actions which are explicitly triggered by the user,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
Comment on lines 569 to 571
efficiency concern, web page may not be notified if the second voice
activity started soon after last {{MediaSessionAction/voiceactivity}}
action.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
efficiency concern, web page may not be notified if the second voice
activity started soon after last {{MediaSessionAction/voiceactivity}}
action.
efficiency concerns, the web page may not be notified if voice
activity ends and restarts soon after the last {{MediaSessionAction/voiceactivity}}
action.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

jianjunz added 2 commits July 12, 2024 10:54
MediaStreamTrack.muted is a readonly attribute. Replace it with enabled,
which can be set by the application.
Copy link
Member Author

@jianjunz jianjunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.

index.bs Outdated
@@ -412,6 +412,11 @@ platform UI or media keys, thereby improving the user experience.
the action's intent is to open the media session in a
picture-in-picture window.
</li>
<li>
<dfn enum-value for=MediaSessionAction>voiceactivity</dfn>:
the action's intent is to notify the web page that a voice activity
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all
{{MediaStreamTrack}}s associated with the source are not
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a
minimal interval for invoking {{MediaSessionActionHandler}} for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
</p>

<p class=note>
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated

<p class=note>
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
activity. Application may display a notification if the user is speaking
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice
activity. Application may display a notification if the user is speaking
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for
audio processing. No action is defined for the end of a voice activity.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
activity. Application may display a notification if the user is speaking
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for
audio processing. No action is defined for the end of a voice activity.
Unlike other actions which are explicitely triggered by the user,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
Comment on lines 569 to 571
efficiency concern, web page may not be notified if the second voice
activity started soon after last {{MediaSessionAction/voiceactivity}}
action.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
from a microphone with one or more live {{MediaStreamTrack}}s. A user
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if
microphone is not muted and all {{MediaStreamTrack}}s associated with the
source are {{MediaStreamTrack/enabled}}. It is RECOMMENDED for user agents
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced MediaStreamTrack.muted with MediaStreamTrack.enabled because muted is a readonly property. Application may want to set enabled to true if voice activity is detected.

Copy link
Contributor

@youennf youennf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
I think we can add some wording stating the voiceactivity action source MUST always have a target whose document MUST always have live microphone {{MediaStreamTrack}}s. This will more closely reuse spec wording and tighten the fact that voiceacitvity cannot happen on a document that is not capturing microphone

Copy link
Member

@chrisn chrisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a few more minor comments.

index.bs Outdated
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if
microphone is not muted and all {{MediaStreamTrack}}s associated with the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
microphone is not muted and all {{MediaStreamTrack}}s associated with the
the microphone is not muted and all {{MediaStreamTrack}}s associated with the

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
<p>
A user agent MUST invoke the {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "live" here is the same as {{MediaStreamTrackState/live}}, we could link them:

Suggested change
from a microphone with one or more live {{MediaStreamTrack}}s. A user
from a microphone with one or more {{MediaStreamTrackState/live}} {{MediaStreamTrack}}s. A user

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
A user agent MUST invoke the {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Ignore" seems strange here and implies the action is generated somehow, but ignored. Could we say "A user agent MAY ignore voice activity if the microphone is not muted ..."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. It sounds better.

@jianjunz
Copy link
Member Author

This looks good to me. I think we can add some wording stating the voiceactivity action source MUST always have a target whose document MUST always have live microphone {{MediaStreamTrack}}s. This will more closely reuse spec wording and tighten the fact that voiceacitvity cannot happen on a document that is not capturing microphone

Thanks for the suggestion. It's added to the beginning of the paragraph.

Copy link
Member

@chrisn chrisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Copy link
Member Author

@jianjunz jianjunz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

Thanks for your review. Please help merge it if it's okay to merge.

index.bs Outdated
A user agent MUST invoke the {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. It sounds better.

index.bs Outdated
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if
microphone is not muted and all {{MediaStreamTrack}}s associated with the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

index.bs Outdated
<p>
A user agent MUST invoke the {{MediaSessionActionHandler}} for
{{MediaSessionAction/voiceactivity}} only when voice activity is detected
from a microphone with one or more live {{MediaStreamTrack}}s. A user
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@chrisn
Copy link
Member

chrisn commented Jul 15, 2024

Thanks! As they're the editors, I leave @youennf or @steimelchrome to merge.

@youennf
Copy link
Contributor

youennf commented Jul 18, 2024

Merging given comments have been addressed.

@youennf youennf merged commit 0f6e693 into w3c:main Jul 18, 2024
2 checks passed
github-actions bot added a commit that referenced this pull request Jul 18, 2024
SHA: 0f6e693
Reason: push, by youennf

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@chrisn chrisn removed the agenda Topic should be discussed in a group call label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding onVoiceActivity event on MediaStreamTrack for audio
4 participants